Detect Medicare Fraud in the US through Classification Model

Medical fraud is a major contributor to inflating expenses and waste in the US healthcare system. Forbes identifies the industry as among the most vulnerable to fraud & cyber-attacks. This investigation explores physician-level Medicare and Medicaid data through exploratory data analysis and a classfication model, hoping to show how big data can be used to prevent this problem.

The inspiration for this project came from my coleague at IBM, whose mother was a victim of medical ID theft. Although the problem of ID theft is slightly different (would require transctional data), CMS and LEIE data are among the best publicly available sources for healthcare and thus were used as a proof-of-concept. It reaffirms that, similar to credit card fraud detection, fraudulent practices in healthcare can be prevented and alleviated. The most vulnerable demographics, dependent elders of 65 years and older, would benefit the most from this technology.

Healthcare fraud in the news:

Data sources:

In cleaning and combining the two datasets, I followed closely previous literature, which are both listed in the "References" section.

In [1]:
import os
import pandas as pd
In [2]:
# Set up data directory
CWD = os.getcwd()
cms_data_dir = os.path.join(CWD, 'CMSData')
In [3]:
# Some years columns are capitalized and other years the columns are lowercase:
capitalization_dict = {
    '2012': str.upper,
    '2013': str.upper,
    '2014': str.lower,
    '2015': str.lower,
    '2016': str.upper,
    '2017': str.lower,
}

1. CMS Part B dataset

In [4]:
# Set dtypes based on https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/...
#Medicare-Provider-Charge-Data/Physician-and-Other-Supplier2017
partB_dtypes = {
    'npi': 'str',
    'nppes_provider_last_org_name': 'str',
    'nppes_provider_first_name': 'str',
    'nppes_provider_mi': 'str',
    'nppes_credentials': 'str',
    'nppes_provider_gender': 'str',
    'nppes_entity_code': 'str',
    'nppes_provider_street1': 'str',
    'nppes_provider_street2': 'str',
    'nppes_provider_city': 'str',
    'nppes_provider_zip': 'str',
    'nppes_provider_state': 'str',
    'nppes_provider_country': 'str',
    'provider_type': 'str',
    'medicare_participation_indicator': 'str',
    'place_of_service': 'str',
    'hcpcs_code': 'str',
    'hcpcs_description': 'str',
    'hcpcs_drug_indicator': 'str',
    'line_srvc_cnt': 'float64',
    'bene_unique_cnt': 'float64',    
    'bene_day_srvc_cnt': 'float64',
    'average_medicare_allowed_amt': 'float64',
    'average_submitted_chrg_amt': 'float64',
    'average_medicare_payment_amt': 'float64',
    'average_medicare_standard_amt': 'float64',
}
In [5]:
# Get dfs for all years - TAKE A FEW MINUTES
years = ['2012','2013','2015','2016']
dfs   = []

for year in years:
    file = os.path.join(cms_data_dir, f'cms{year}.txt')
    dtypes = dict(zip(list(map(capitalization_dict[year], partB_dtypes.keys())), list(partB_dtypes.values()))) #get correct column capitalization and dtype
    df = pd.read_csv(file, delimiter='\t', dtype=dtypes)
    df.columns = map(str.lower, df.columns)  # make all variable names lowercase
    df['year'] = year #add Year column 
    dfs.append(df)
In [6]:
# Concatenate
partB_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
partB_df.shape
Out[6]:
(37653939, 30)
In [7]:
# Remove rows corresponding to drugs because LINE_SRVC_CNT for them is not a desirable count
partB_df = partB_df[(partB_df['hcpcs_drug_indicator'] == 'N')]
partB_df.shape
Out[7]:
(35368294, 30)
In [8]:
# Drop missing NPI and HCPCS - "Medicare fraud detection using neural networks" (Johnson, Khoshgoftaar 2019)
# This means dropping 2014 and 2016 - both did not have HCPCS Code
partB_df = partB_df.dropna(subset = ['npi','hcpcs_code'])
partB_df.shape
Out[8]:
(35368294, 30)
In [9]:
# Keep variables based on "Medicare fraud detection using neural networks" (Johnson, Khoshgoftaar 2019)
partB_variables_to_keep = [
    'npi',
    'provider_type',
    'nppes_provider_city', # keep
    'nppes_provider_zip', # keep
    'nppes_provider_state', # keep
    'nppes_provider_country', # keep
    'hcpcs_code',  # not in paper but kept
    'hcpcs_description',  # not in paper but kept
    'hcpcs_drug_indicator',  # not in paper but kept
    'place_of_service',  # not in paper but kept
    'nppes_provider_gender',
    'line_srvc_cnt',
    'bene_unique_cnt',
    'bene_day_srvc_cnt',
    'average_submitted_chrg_amt',
    'average_medicare_payment_amt',
    'year' # need Year for labeling
]
partB_df = partB_df[partB_variables_to_keep]
In [10]:
partB_df.head()
Out[10]:
npi provider_type nppes_provider_city nppes_provider_zip nppes_provider_state nppes_provider_country hcpcs_code hcpcs_description hcpcs_drug_indicator place_of_service nppes_provider_gender line_srvc_cnt bene_unique_cnt bene_day_srvc_cnt average_submitted_chrg_amt average_medicare_payment_amt year
1 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99222 Initial hospital inpatient care, typically 50 ... N F M 115.0 112.0 115.0 199.0 108.115652 2012
2 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99223 Initial hospital inpatient care, typically 70 ... N F M 93.0 88.0 93.0 291.0 158.870000 2012
3 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99231 Subsequent hospital inpatient care, typically ... N F M 111.0 83.0 111.0 58.0 30.720721 2012
4 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99232 Subsequent hospital inpatient care, typically ... N F M 544.0 295.0 544.0 105.0 56.655662 2012
5 1003000126 Internal Medicine CUMBERLAND 215021854 MD US 99233 Subsequent hospital inpatient care, typically ... N F M 75.0 55.0 75.0 150.0 81.390000 2012
In [11]:
partB_df.loc[partB_df['npi'] == '1003000142'][['npi',
                                             'provider_type',
                                             'place_of_service',
                                             'line_srvc_cnt',
                                             'average_submitted_chrg_amt',
                                             'year']][:5]
Out[11]:
npi provider_type place_of_service line_srvc_cnt average_submitted_chrg_amt year
16 1003000142 Anesthesiology O 28.0 216.571429 2012
17 1003000142 Anesthesiology O 24.0 111.000000 2012
9153288 1003000142 Anesthesiology F 56.0 483.000000 2013
9153289 1003000142 Anesthesiology F 16.0 1105.000000 2013
9153290 1003000142 Anesthesiology F 22.0 709.136364 2013
In [12]:
partB_df['year'].value_counts()
Out[12]:
2016    9104337
2015    8904316
2013    8732934
2012    8626707
Name: year, dtype: int64
In [13]:
# Write all combined CMS to csv
#partB_df.to_csv('combined-partB-data-v2')

2. LEIE Dataset

In [14]:
leie_data_dir = os.path.join(CWD, 'LEIEData')
In [15]:
leie_dtypes = {
    'LASTNAME': 'str',
    'FIRSTNAME': 'str',
    'MIDNAME': 'str',
    'BUSNAME' : 'str',
    'GENERAL': 'str',
    'SPECIALTY': 'str',
    'UPIN': 'str',
    'NPI': 'str',
    'DOB': 'str',
    'ADDRESS': 'str',
    'CITY': 'str',
    'STATE': 'str',
    'ZIP': 'str',
    'EXCLTYPE': 'str',
    'EXCLDATE': 'int64',
    'REINDATE': 'int64',
    'WAIVERDATE': 'int64',
    'WVRSTATE': 'str',
}
In [16]:
#LEIE data is monthly between 01/2018 (1801) - 12/2019 (1912)
year_months = ['1801','1802','1803','1804','1805','1806','1807','1808','1809','1810','1811','1812',
            '1901','1902','1903','1904','1905','1906','1907','1908','1909','1910','1911','1912']
dfs = []

for year_month in year_months:
    file = os.path.join(leie_data_dir, f'leie{year_month}-excl.csv')
    df   = pd.read_csv(file, dtype=leie_dtypes)
    df.columns = map(str.lower, df.columns)
    dfs.append(df)
In [17]:
# Concatenate
leie_df = pd.concat(dfs, axis=0, ignore_index=True, sort=False)
leie_df.shape
Out[17]:
(4983, 18)
In [18]:
leie_df.head()
Out[18]:
lastname firstname midname busname general specialty upin npi dob address city state zip excltype excldate reindate waiverdate wvrstate
0 NaN NaN CHANGING STEPS TREATMENT CENTE OTHER BUSINESS COMMUNITY HLTH CTR ( NaN 1477704351 NaN 14540 HAMLIN ST, STE B VAN NUYS CA 91411 1128a1 20180220 0 0 NaN
1 NaN NaN OLIVE TREE FOSTER HOME OTHER BUSINESS ADULT HOME NaN 0000000000 NaN 94-245 PUPUKOAE STREET HONOLULU HI 96797 1128a1 20180220 0 0 NaN
2 AIRHART LAURA PAULINE NaN IND- LIC HC SERV PRO NURSE/NURSES AIDE NaN 0000000000 19770704 7304 FULLER CIRCLE FORT WORTH TX 76133 1128b4 20180220 0 0 NaN
3 ALBERT AMY NaN IND- LIC HC SERV PRO PHYSICIAN'S ASSISTAN NaN 1679639397 19770818 1124 GAINSBORO ROAD LOWER MERION PA 19004 1128b4 20180220 0 0 NaN
4 ALLEN HEATHER ANITRA NaN IND- LIC HC SERV PRO NURSE/NURSES AIDE NaN 0000000000 19740326 1004 BINGHAM AVE ROWAN IA 50470 1128a3 20180220 0 0 NaN
In [19]:
# Drop NPI = 0, which means missing - A LOT ARE MISSING, which is a problem for the data
leie_df = leie_df[leie_df['npi'] != 0]
leie_df.shape
Out[19]:
(4983, 18)
In [20]:
# Keep exclusions most related to Fraud
exclusions_to_keep = [
    '1128a1',
    '1128a2',
    '1128a3',
    '1128b4',
    '1128b7',
    '1128c3Gi',
    '1128c3gii',
]
leie_df = leie_df[leie_df['excltype'].isin(exclusions_to_keep)]
leie_df.shape
Out[20]:
(4294, 18)
In [21]:
leie_df['excltype'].value_counts()
Out[21]:
1128a1    1988
1128b4    1308
1128a3     523
1128a2     425
1128b7      50
Name: excltype, dtype: int64

Type definitions of physician fraud:

  • 1128a1: Conviction of program-related crimes
  • 1128a2: Conviction of relating to patient abuse or neglect
  • 1128a3: Felony conviction relating to healthcare fraud
  • 1128a4: License revocation, suspension, or surrender
  • 1128a3: Fraud, kickbacks, other prohibited activities
In [22]:
# Write all combined LEIE to csv
#partB_df.to_csv('combined-leie-data')

3. Combine/Label Data

In [23]:
from datetime import datetime, timedelta
import numpy as np
In [24]:
# Convert to datetime
leie_df['excldate'] = pd.to_datetime(leie_df['excldate'], format='%Y%m%d', errors ='ignore')
In [25]:
# Round excl date to the nearest year Johnson & Khoshgoftaar (2019)
def round_to_year(dt=None):
    year = dt.year
    month = dt.month
    if month >= 6:
        year = year + 1
    return datetime(year=year,month=1,day=1)

leie_df['excl_year'] = leie_df.excldate.apply(lambda x: round_to_year(x))
In [26]:
# Make exclusion dict 
# 1215053665 has 2 exclusions, so sort also df to get latest year
excl_year_dict = dict([npi, year] for npi, year in zip(leie_df.sort_values(by='excl_year').npi, leie_df.sort_values(by='excl_year').excl_year))
In [27]:
# Get label as 0 or 1
partB_df['excl_year'] = partB_df['npi'].map(excl_year_dict)
partB_df['excl_year'] = partB_df['excl_year'].fillna(datetime(year=1900,month=1,day=1)) # fill NaN, physicians without exclusion, with year 1900

partB_df['year'] = pd.to_datetime(partB_df['year'].astype(str), format='%Y', errors ='ignore')
partB_df['fraudulent'] = np.where(partB_df['year'] < partB_df['excl_year'], 1, 0) # compare year vs. exclusion year to get Fraudulent
In [28]:
print("partB_df is our combined dataset with shape: {0}".format(partB_df.shape))
partB_df is our combined dataset with shape: (35368294, 19)

4. Draw Visualizations

In [29]:
%matplotlib inline

import seaborn as sns
import matplotlib.pyplot as plt

import plotly.figure_factory as ff
import plotly.graph_objects  as go
from plotly.subplots import make_subplots
In [30]:
# Get number and amount of fraudulent services
partB_df['fraud_line_srvc_cnt'] = partB_df['line_srvc_cnt']*partB_df['fraudulent']
partB_df['fraud_average_submitted_chrg_amt'] = partB_df['average_submitted_chrg_amt']*partB_df['fraudulent']
partB_df['fraud_average_medicare_payment_amt'] = partB_df['average_medicare_payment_amt']*partB_df['fraudulent']
In [31]:
# Aggregate by state
state_df = partB_df.groupby('nppes_provider_state').agg({
    'line_srvc_cnt':[('total_services_count','sum')],
    'fraud_line_srvc_cnt':[('total_fraud_services_count','sum')]
}).reset_index()

# Drop multi-index
state_df.columns = ['_'.join(col) for col in state_df.columns]
state_df.columns = ['provider_state', 'total_services_count', 'fraud_services_count']
In [32]:
# Get % fraud
state_df['fraud_services_pct'] = state_df['fraud_services_count']/state_df['total_services_count']
state_df.head()
Out[32]:
provider_state total_services_count fraud_services_count fraud_services_pct
0 AA 26875.0 0.0 0.000000
1 AE 45118.0 0.0 0.000000
2 AK 7096804.1 0.0 0.000000
3 AL 145179172.2 27972.0 0.000193
4 AP 35607.0 0.0 0.000000
In [33]:
np.seterr(divide = 'ignore') 

fig = go.Figure(data=go.Choropleth(
    locations=state_df['provider_state'],
    z = np.log(state_df['fraud_services_pct'].astype(float))+0.000001, #log-scale
    locationmode = 'USA-states',
    colorscale = 'Reds',
    colorbar_title = "Logged %",
    marker_line_color='white'
))

fig.update_layout(
    title_text = 'Logged Percentage of Medicare Fraudulent Service by US State (2012-2016)',
    geo_scope='usa',
)

fig.show()

Heat map shows a high concentration of fraudulent Medicare claims around the Northeast and South of the US, with Texas leading all states by a wide margin.

In [34]:
# Get dummy variables for gender
partB_df2 = pd.concat([partB_df, pd.get_dummies(partB_df['nppes_provider_gender'])], axis=1)
In [35]:
# Aggregate by provider type
type_df = partB_df2.groupby('provider_type').agg({
    'line_srvc_cnt':['sum'],
    'fraud_line_srvc_cnt':['sum'],
    'M':['sum'],
    'F':['sum'],
    'average_submitted_chrg_amt':['median'], #since distribution skewed right
    'average_medicare_payment_amt':['median'],
    'fraud_average_submitted_chrg_amt':['max'],
    'fraud_average_medicare_payment_amt':['max']
}).reset_index()

# Drop multi-index
type_df.columns = ['_'.join(col) for col in type_df.columns]
type_df.columns = ['provider_type', 'total_services_count', 'fraud_services_count','male_count','female_count',
                   'avg_submitted_chrg_amt','avg_medicare_payment_amt', 'fraud_avg_submitted_chrg_amt','fraud_avg_medicare_payment_amt']
In [36]:
# Sorting
type_df = type_df.sort_values('fraud_services_count',ascending=False)[:15] #get top 15 fraudulent types
type_df = type_df.sort_values('total_services_count',ascending=True).reset_index(drop=True) #re-sort by total services

# Add some fields
type_df['non_fraud_services_count'] = type_df['total_services_count'] - type_df['fraud_services_count']
type_df['fraud_services_pct'] = (type_df['fraud_services_count']/type_df['total_services_count'])*100
type_df.head()
Out[36]:
provider_type total_services_count fraud_services_count male_count female_count avg_submitted_chrg_amt avg_medicare_payment_amt fraud_avg_submitted_chrg_amt fraud_avg_medicare_payment_amt non_fraud_services_count fraud_services_pct
0 Geriatric Medicine 10738563.3 52334.0 42616.0 34393.0 155.594615 62.206552 800.000000 178.185878 10686229.3 0.487346
1 Pain Management 14888967.2 96178.0 90837.0 11511.0 300.000000 68.560030 1980.000000 288.390714 14792789.2 0.645968
2 General Practice 26465791.9 114533.0 150038.0 33452.0 94.000000 41.020000 603.641975 201.138500 26351258.9 0.432759
3 Anesthesiology 44806223.5 189943.0 733834.0 155457.0 718.158333 85.868923 2722.142857 547.420385 44616280.5 0.423921
4 Neurology 54613299.3 108023.0 398365.0 126310.0 221.935065 77.823478 2634.000000 604.060909 54505276.3 0.197796
In [37]:
# Get 2015 data only for speed
partB_2015_df = partB_df2[partB_df2['year'] == datetime(year=2015,month=1,day=1)]
top_7_types  = type_df['provider_type'][:7].tolist()

for p_type in top_7_types:
    
    #Eliminate 0 payments then log
    x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_submitted_chrg_amt!=0)]['average_submitted_chrg_amt'])
    fraud_x = np.log(partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_submitted_chrg_amt!=0)]['fraud_average_submitted_chrg_amt'])
    
    fig,ax = plt.subplots(figsize=(12,5))
    sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
    sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)

    ax.set(
        title  ='Distribution of Submitted Charge Amount - Fraud vs. Non-fraud - '+ p_type,
        xlabel = 'Log of USD Payment Amount',
        ylabel = 'Count'
          )
    
plt.show()

As shown in histogram comparisons, no obvious evidence to say that the average fraudulent submitted charge is more expensive than non-fraudulent charge across the Top 15 Fraudulent categories.

This does not support the hypothesis I previously had - perhaps because doctors are pretty good at hiding wrongful claims and not making it obvious by listing unreasonably high costs.

In [38]:
for p_type in top_7_types:
    #Eliminate 0 payments
    x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.average_medicare_payment_amt!=0)]['average_medicare_payment_amt']
    fraud_x = partB_2015_df[(partB_2015_df.provider_type == p_type) & (partB_2015_df.fraud_average_medicare_payment_amt!=0)]['fraud_average_medicare_payment_amt']
    
    fig,ax = plt.subplots(figsize=(12,5))
    sns.distplot(x,label='Non-fraudulent', hist=False, rug=False)
    sns.distplot(fraud_x,label='Fraudulent', hist=False, rug=False)
    ax.set(
        title  ='Distribution of Medicare Payment Amount - Fraud vs. Non-fraud - '+ p_type,
        xlabel = 'Log of USD Payment Amount',
        ylabel = 'Count'
          )
    plt.show()

Same conclusion can be made here about Medicare Payment amount.

In [39]:
partB_2015_df = partB_df[partB_df['year'] == datetime(year=2015,month=1,day=1)]

fig, ax = plt.subplots(figsize=(15,7))

sns.heatmap(partB_2015_df.corr(method='pearson'), annot=True, fmt='.4f', 
            cmap=plt.get_cmap('coolwarm'), cbar=False, robust=True)
ax.set_yticklabels(ax.get_yticklabels(), rotation="horizontal")
ax.set(
    title  ='Correlation Heatmap in 2015 (only Continuous Variables)',

      )
plt.show()

No insights are particularly interesting here, outside of the correlations we would normally expect.

In [40]:
# Get categorical variables (except location)
for col in ['nppes_provider_gender','provider_type']:
    partB_2015_df = pd.concat([partB_2015_df, pd.get_dummies(partB_2015_df[col], drop_first= True)], axis=1)
    partB_2015_df = partB_2015_df.drop(col, 1)
In [42]:
def get_redundant_pairs(df):
    '''Get duplicate pairs to drop in correlation matrix after unstacking'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_correlations(df):
    '''Get biggest correlations'''
    au_corr = df.corr().unstack() #unstack
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop)
    return au_corr

#get relevant columns
cols = partB_2015_df.columns[20:].tolist() + ['line_srvc_cnt','bene_unique_cnt','average_submitted_chrg_amt','fraudulent']
corr = get_correlations(partB_2015_df[cols])
In [43]:
corr = corr.to_frame().reset_index() #to frame
corr.columns = ['Variable A', 'Variable B', 'Correlation']

def color_df(value):
    '''Color red if positive, green if negative'''
    if value < 0:
        color = 'red'
    elif value > 0:
        color = 'green'
    else:
        color = 'black'
    return 'color: %s' % color

corr = corr.reindex(corr['Correlation'].abs().sort_values(ascending=False).index).reset_index(drop=True)
corr[:15].style.applymap(color_df, subset=['Correlation'])
Out[43]:
Variable A Variable B Correlation
0 line_srvc_cnt bene_unique_cnt 0.514602
1 Ambulatory Surgical Center average_submitted_chrg_amt 0.282647
2 M Nurse Practitioner -0.270379
3 M Clinical Laboratory -0.179634
4 Diagnostic Radiology Internal Medicine -0.146244
5 Diagnostic Radiology Family Practice -0.128436
6 M Diagnostic Radiology 0.123625
7 Clinical Laboratory bene_unique_cnt 0.121128
8 Family Practice Internal Medicine -0.119511
9 M Physician Assistant -0.117444
10 M Ambulatory Surgical Center -0.117421
11 M Orthopedic Surgery 0.107368
12 Anesthesiology average_submitted_chrg_amt 0.106800
13 M Mass Immunization Roster Biller -0.105435
14 M Cardiology 0.103208

Again, too many interesting conclusions can be made here, apart from the fact that some medical fields/types are dominated by certain gender. The very top, Line Service Count (number of services performed) and Unique Beneficiary Count (number of distinct beneficiaries), make sense as having a positive correlation over 0.5.

Most notable is that Ambulatory Surgical Center and Anesthesiology are quite "expensive" and very positively correlated to Submitted Charge Amount -- an important insight as they are among the high flyers of physician fraud.

In [44]:
fig = go.Figure()

col_layout_dict = {'non_fraud_services_count': ['Non-fraud Service','rgba(50, 171, 96, 0.6)'],
                 'fraud_services_count': ['Fraud Service','rgb(255, 0, 0)']} #dict for layout

for col in ['non_fraud_services_count','fraud_services_count']:
    fig.add_trace(go.Bar(
        y=type_df['provider_type'],
        x=type_df[col],
        name=col_layout_dict[col][0],
        marker=dict(
            color=col_layout_dict[col][1],
        ),
        orientation='h',
    ))

fig.update_layout(
    barmode = 'stack',
    title = 'Top 15 Fraudulent Medicare Category by Service Count',
    paper_bgcolor='white',
    plot_bgcolor='white',
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
        domain=[0, 0.95],
    ),
    xaxis=dict(
        zeroline=False,
        showline=False,
        showticklabels=True,
        showgrid=True,
        domain=[0, 0.95],
    ),
    xaxis_title_text='Count',
)

annotations = [] #annotate with %

x   = type_df['fraud_services_count']+type_df['non_fraud_services_count']+100000000
y_p = np.round(type_df['fraud_services_pct'].tolist(), decimals=2)

for y_p, x, y in zip(y_p,x,type_df['provider_type']):
    annotations.append(dict(xref='x1', yref='y1',
                            y=y, x=x,
                            text=str(y_p) + '%',
                            font=dict(family='Arial', size=12,
                                      color='rgb(255, 0, 0)'),
                            showarrow=False))

annotations.append(dict(xref='paper', yref='paper',
                        x=-0.2, y=-0.209,
                        text='Combined CMS and LEIE data' +
                             'to label the leading Fraudulent physician categories (15 Feb 2020)',
                        font=dict(family='Arial', size=10, color='rgb(150,150,150)'),
                        showarrow=False))

fig.update_layout(annotations=annotations)

fig.show()

By our definition, Anesthesiology, General Practice, Pain Management, and Getriatric Medicine have a high percentage of fraudulent services performed by excluded physicians.

Particularly, 65% of Pain Management cases were wrongfully performed (an overinflated statistic as we qualified ALL previous services performed by an excluded physician as fraudulent if they were on the list - the reality is unlikely for ALL). Regardless, it makes sense that the high flyers would be General Practice, Pain Mangement, and Geriatric Medicine (the elderlies are the most vulnerable).

In [45]:
fig = go.Figure()

#dict for layout
col_layout_dict = {'female_count': ['Female Physicians ','#ffcdd2'],
                 'male_count': ['Male Physicians','#A2D5F2']}

for col in ['female_count','male_count']:
    fig.add_trace(go.Bar(
        y=type_df['provider_type'],
        x=type_df[col],
        name=col_layout_dict[col][0],
        marker=dict(
            color=col_layout_dict[col][1],
        ),
        orientation='h',
    ))

fig.update_layout(
    barmode = 'stack',
    title = 'Top 15 Fraudulent Medicare Category by Service Count and Gender',
    paper_bgcolor='white',
    plot_bgcolor='white',
    yaxis=dict(
        showgrid=False,
        showline=False,
        showticklabels=True,
        domain=[0, 0.90],
    ),
    xaxis=dict(
        zeroline=False,
        showline=False,
        showticklabels=True,
        showgrid=True,
        domain=[0, 0.90],
    ),
    xaxis_title_text='Count',
)

fig.show()

The top fraudulent fields/categories are dominated by Male physicians.

Clinical Laboratory and Ambulance Service Provider has a lot of line services but require fewer doctors - so they did not have enough quantity to show on the bar plot?

5. Prep Data for Training

  • Random undersample data with 50:50 class distributions (reduce the majority class to minority until 1:1)
  • One hot encoding
In [46]:
partB_df.columns
Out[46]:
Index(['npi', 'provider_type', 'nppes_provider_city', 'nppes_provider_zip',
       'nppes_provider_state', 'nppes_provider_country', 'hcpcs_code',
       'hcpcs_description', 'hcpcs_drug_indicator', 'place_of_service',
       'nppes_provider_gender', 'line_srvc_cnt', 'bene_unique_cnt',
       'bene_day_srvc_cnt', 'average_submitted_chrg_amt',
       'average_medicare_payment_amt', 'year', 'excl_year', 'fraudulent',
       'fraud_line_srvc_cnt', 'fraud_average_submitted_chrg_amt',
       'fraud_average_medicare_payment_amt'],
      dtype='object')
In [47]:
#drop variables that we are not using
partB_train_df = partB_df.drop(columns=[
    'fraud_average_medicare_payment_amt', #for visual only
    'fraud_average_submitted_chrg_amt', #for visual only
    'fraud_line_srvc_cnt', #for visual only
    'year',
    'excl_year',
    'hcpcs_drug_indicator',
    'place_of_service',
    'nppes_provider_city',
    'nppes_provider_country',
    'hcpcs_description',
    'nppes_provider_zip'
])
In [152]:
a = partB_train_df.fraudulent.value_counts()
display(a)
print('Fraudulent physicians are: {0}% of all data'.format(str(np.round((a[1]/a[0])*100,decimals = 6)))) 
0    35359427
1        8867
Name: fraudulent, dtype: int64
Fraudulent physicians are: 0.025077% of all data
In [49]:
display(partB_train_df.head())
print("We have 9 features, including 4 categorical variables in \n {0}".format(partB_train_df.columns[1:5].tolist()))
npi provider_type nppes_provider_state hcpcs_code nppes_provider_gender line_srvc_cnt bene_unique_cnt bene_day_srvc_cnt average_submitted_chrg_amt average_medicare_payment_amt fraudulent
1 1003000126 Internal Medicine MD 99222 M 115.0 112.0 115.0 199.0 108.115652 0
2 1003000126 Internal Medicine MD 99223 M 93.0 88.0 93.0 291.0 158.870000 0
3 1003000126 Internal Medicine MD 99231 M 111.0 83.0 111.0 58.0 30.720721 0
4 1003000126 Internal Medicine MD 99232 M 544.0 295.0 544.0 105.0 56.655662 0
5 1003000126 Internal Medicine MD 99233 M 75.0 55.0 75.0 150.0 81.390000 0
We have 9 features, including 4 categorical variables in 
 ['provider_type', 'nppes_provider_state', 'hcpcs_code', 'nppes_provider_gender']
In [146]:
def undersample(df, category, split=0.5):
    '''Random undersampling based on a category and class split (% of dataset belonged to majority class)'''
    fraud_number = df[category].sum()
    non_fraud_indices = df[df[category] == 0].index #get indices of non-fraud
    random_indices = np.random.choice(non_fraud_indices, size=int(round(fraud_number*(split/(1-split)), 0)), 
                                      replace=False) #sample at random without replacement
    
    fraud_indices = df[df[category] == 1].index #get indices of fraud
    under_sample_indices = np.concatenate([fraud_indices, random_indices])
    return df.loc[under_sample_indices]

data = undersample(partB_train_df, 'fraudulent')

#print
a = data.fraudulent.value_counts()
display(a)
print('Fraudulent physicians are {0}% of all data.'.format(str(np.round((a[1]/(a[0]+a[1]))*100,decimals = 2)))) 
1    8867
0    8867
Name: fraudulent, dtype: int64
Fraudulent physicians are 50.0% of all data.
In [144]:
# # We are NOT normalizing the data
# from sklearn import preprocessing

# def scale_predictors(df):
#     '''Takes in df and returns normalized data [0,1] '''
#     x  = df.values #numpy array
#     min_max_scaler = preprocessing.MinMaxScaler()
#     x_scaled = min_max_scaler.fit_transform(x)
#     return pd.DataFrame(x_scaled, columns=df.columns, index=df.index)
In [147]:
# One-hot encoding
def one_hot_encode(df, cols):
    '''Encodes categorical variables in df with cols as an array'''
    for col in cols:
        df = pd.concat([df, pd.get_dummies(df[col], drop_first= True)], axis=1)
        return df

cols = ['provider_type', 'nppes_provider_state', 'hcpcs_code', 'nppes_provider_gender']
data = one_hot_encode(data, cols)
data = data.drop(columns=cols, axis=1).reset_index(drop=True) #drop column that's been encoded
In [148]:
display(data.shape)
print('There are {0} predictors after turning categories into dummy variables.'.format(data.shape[1]-2)) 
(17734, 106)
There are 104 predictors after turning categories into dummy variables.

6. Model:

In [149]:
data.head()
Out[149]:
npi line_srvc_cnt bene_unique_cnt bene_day_srvc_cnt average_submitted_chrg_amt average_medicare_payment_amt fraudulent Allergy/ Immunology Allergy/Immunology Ambulance Service Provider ... Rheumatology Sleep Medicine Speech Language Pathologist Sports Medicine Surgical Oncology Thoracic Surgery Undefined Physician type Unknown Physician Specialty Code Urology Vascular Surgery
0 1003870239 14.0 13.0 14.0 265.428571 62.630000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1003870239 610.0 91.0 610.0 648.000000 236.548836 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1003870239 217.0 91.0 217.0 540.000000 200.677419 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1003870239 24.0 19.0 24.0 430.000000 154.030000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1003870239 11.0 11.0 11.0 220.000000 140.120000 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 106 columns

In [157]:
data.fraudulent.value_counts()
Out[157]:
1    8867
0    8867
Name: fraudulent, dtype: int64
In [159]:
import statsmodels.api as sm

cols = [col for col in data.columns if col not in ['npi', 'fraudulent']]
X = data[cols]
y = data['fraudulent']

logit_model=sm.Logit(y,X)
result=logit_model.fit(method='bfgs')
print(result.summary2())
/usr/local/lib/python3.7/site-packages/statsmodels/discrete/discrete_model.py:1747: RuntimeWarning:

overflow encountered in exp

/usr/local/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.566455
         Iterations: 35
         Function evaluations: 55
         Gradient evaluations: 45
                                        Results: Logit
===============================================================================================
Model:                        Logit                      Pseudo R-squared:           0.183     
Dependent Variable:           fraudulent                 AIC:                        20299.0405
Date:                         2020-02-23 02:17           BIC:                        21108.4973
No. Observations:             17734                      Log-Likelihood:             -10046.   
Df Model:                     103                        LL-Null:                    -12292.   
Df Residuals:                 17630                      LLR p-value:                0.0000    
Converged:                    0.0000                     Scale:                      1.0000    
-----------------------------------------------------------------------------------------------
                                                Coef.  Std.Err.    z     P>|z|   [0.025  0.975]
-----------------------------------------------------------------------------------------------
line_srvc_cnt                                   0.0000   0.0000   1.0707 0.2843 -0.0000  0.0001
bene_unique_cnt                                -0.0014   0.0001  -9.5802 0.0000 -0.0017 -0.0011
bene_day_srvc_cnt                               0.0008   0.0001   9.0011 0.0000  0.0006  0.0010
average_submitted_chrg_amt                     -0.0000   0.0000  -1.1687 0.2425 -0.0001  0.0000
average_medicare_payment_amt                   -0.0003   0.0002  -1.5382 0.1240 -0.0007  0.0001
Allergy/ Immunology                             0.0316   0.5563   0.0568 0.9547 -1.0588  1.1220
Allergy/Immunology                              0.0921   0.2704   0.3405 0.7335 -0.4380  0.6221
Ambulance Service Provider                      0.0420   0.4913   0.0856 0.9318 -0.9208  1.0049
Ambulance Service Supplier                      0.0623   0.2699   0.2307 0.8175 -0.4668  0.5913
Ambulatory Surgical Center                     -0.5103   0.3102  -1.6449 0.1000 -1.1182  0.0977
Anesthesiologist Assistants                    -0.0497   1.0013  -0.0496 0.9604 -2.0122  1.9128
Anesthesiology                                  0.6062   0.0867   6.9946 0.0000  0.4363  0.7760
Audiologist                                     0.2161   0.4605   0.4693 0.6389 -0.6864  1.1186
Audiologist (billing independently)             0.3049   0.3077   0.9908 0.3218 -0.2982  0.9080
CRNA                                            0.0487   0.1910   0.2550 0.7987 -0.3257  0.4232
Cardiac Electrophysiology                       0.1497   0.2660   0.5627 0.5736 -0.3717  0.6710
Cardiac Surgery                                -0.2236   0.4666  -0.4792 0.6318 -1.1382  0.6910
Cardiology                                      0.3086   0.0693   4.4538 0.0000  0.1728  0.4444
Cardiovascular Disease (Cardiology)             0.3040   0.1322   2.2993 0.0215  0.0449  0.5632
Centralized Flu                                -0.1458   0.5828  -0.2501 0.8025 -1.2881  0.9965
Certified Clinical Nurse Specialist            -0.0626   0.6674  -0.0939 0.9252 -1.3708  1.2455
Certified Nurse Midwife                        -0.0254   1.4146  -0.0179 0.9857 -2.7980  2.7472
Certified Registered Nurse Anesthetist (CRNA)  -0.1977   0.2936  -0.6734 0.5007 -0.7732  0.3777
Chiropractic                                   -0.3234   0.2036  -1.5886 0.1121 -0.7224  0.0756
Clinical Cardiatric Electrophysiology          -0.1952   0.5042  -0.3872 0.6986 -1.1834  0.7930
Clinical Laboratory                             1.8945   0.1208  15.6850 0.0000  1.6578  2.1312
Clinical Psychologist                           0.5464   0.2419   2.2591 0.0239  0.0724  1.0204
Colorectal Surgery (Proctology)                -0.0244   1.4175  -0.0172 0.9863 -2.8027  2.7539
Colorectal Surgery (formerly proctology)       -0.0975   0.7104  -0.1372 0.8909 -1.4898  1.2949
Critical Care (Intensivists)                   -0.2441   0.4519  -0.5401 0.5891 -1.1298  0.6416
Dermatology                                    -0.4104   0.1233  -3.3281 0.0009 -0.6520 -0.1687
Diagnostic Radiology                           -5.4303   0.4418 -12.2904 0.0000 -6.2963 -4.5643
Emergency Medicine                             -0.2812   0.1086  -2.5889 0.0096 -0.4942 -0.0683
Endocrinology                                  -0.3232   0.1982  -1.6303 0.1030 -0.7117  0.0654
Family Practice                                 0.0716   0.0450   1.5923 0.1113 -0.0165  0.1597
Gastroenterology                               -1.4297   0.1789  -7.9930 0.0000 -1.7803 -1.0791
General Practice                                1.2783   0.1465   8.7269 0.0000  0.9912  1.5654
General Surgery                                -0.1163   0.1307  -0.8895 0.3737 -0.3725  0.1399
Geriatric Medicine                              2.1242   0.1808  11.7459 0.0000  1.7698  2.4787
Geriatric Psychiatry                           -0.0133   2.0007  -0.0067 0.9947 -3.9346  3.9080
Gynecological Oncology                         -0.0252   1.4149  -0.0178 0.9858 -2.7984  2.7481
Gynecological/Oncology                         -0.0749   0.8181  -0.0916 0.9270 -1.6783  1.5284
Hand Surgery                                   -0.2164   0.4773  -0.4534 0.6503 -1.1519  0.7191
Hematology                                     -0.0874   0.7575  -0.1154 0.9082 -1.5720  1.3972
Hematology-Oncology                            -0.1865   0.2966  -0.6287 0.5295 -0.7679  0.3949
Hematology/Oncology                            -0.3422   0.1683  -2.0332 0.0420 -0.6721 -0.0123
Hospice and Palliative Care                    -0.0248   1.4172  -0.0175 0.9860 -2.8026  2.7529
Independent Diagnostic Testing Facility         1.0482   0.1840   5.6966 0.0000  0.6876  1.4089
Independent Diagnostic Testing Facility (IDTF)  0.4534   0.2699   1.6803 0.0929 -0.0755  0.9823
Infectious Disease                             -0.3665   0.3689  -0.9936 0.3204 -1.0896  0.3565
Internal Medicine                               0.6462   0.0417  15.4901 0.0000  0.5645  0.7280
Interventional Cardiology                      -0.3809   0.3587  -1.0619 0.2883 -1.0840  0.3221
Interventional Pain Management                  0.8363   0.1702   4.9131 0.0000  0.5027  1.1699
Interventional Radiology                       -0.3430   0.3796  -0.9034 0.3663 -1.0871  0.4011
Licensed Clinical Social Worker                 0.1473   0.2673   0.5511 0.5815 -0.3766  0.6713
Mass Immunization Roster Biller                -0.3132   0.3992  -0.7846 0.4327 -1.0955  0.4692
Mass Immunizer Roster Biller                   -0.1607   0.5578  -0.2881 0.7733 -1.2540  0.9326
Medical Oncology                               -0.5437   0.3073  -1.7691 0.0769 -1.1461  0.0587
Nephrology                                     -1.1956   0.2242  -5.3320 0.0000 -1.6350 -0.7561
Neurology                                       1.4032   0.1135  12.3661 0.0000  1.1808  1.6256
Neurosurgery                                   -0.4812   0.3107  -1.5491 0.1214 -1.0901  0.1276
Nuclear Medicine                               -0.0877   0.7575  -0.1157 0.9079 -1.5723  1.3970
Nurse Practitioner                             -0.5073   0.0892  -5.6886 0.0000 -0.6820 -0.3325
Obstetrics & Gynecology                        -0.1710   0.3071  -0.5569 0.5776 -0.7729  0.4309
Obstetrics/Gynecology                          -0.6267   0.1909  -3.2834 0.0010 -1.0008 -0.2526
Occupational Therapist in Private Practice     -0.0682   0.8999  -0.0757 0.9396 -1.8320  1.6957
Occupational therapist                         -0.1233   0.6774  -0.1821 0.8555 -1.4511  1.2044
Ophthalmology                                  -1.7315   0.1680 -10.3078 0.0000 -2.0607 -1.4023
Optometry                                      -0.6098   0.1293  -4.7157 0.0000 -0.8632 -0.3564
Oral Surgery (Dentists only)                   -0.0375   1.1559  -0.0324 0.9741 -2.3030  2.2280
Oral Surgery (dentists only)                   -0.0127   2.0005  -0.0063 0.9950 -3.9335  3.9082
Orthopedic Surgery                             -0.2264   0.0982  -2.3057 0.0211 -0.4189 -0.0340
Osteopathic Manipulative Medicine              -0.0376   1.1557  -0.0326 0.9740 -2.3027  2.2274
Otolaryngology                                  0.1458   0.1386   1.0518 0.2929 -0.1259  0.4176
Pain Management                                 0.8223   0.1909   4.3080 0.0000  0.4482  1.1964
Pathology                                      -1.6405   0.2096  -7.8274 0.0000 -2.0513 -1.2297
Pediatric Medicine                              0.4319   0.2928   1.4753 0.1401 -0.1419  1.0058
Peripheral Vascular Disease                    -0.0123   2.0037  -0.0062 0.9951 -3.9395  3.9148
Physical Medicine and Rehabilitation            0.4833   0.1457   3.3161 0.0009  0.1976  0.7689
Physical Therapist                             -0.8942   0.1483  -6.0315 0.0000 -1.1848 -0.6036
Physical Therapist in Private Practice         -0.7058   0.2254  -3.1321 0.0017 -1.1475 -0.2641
Physician Assistant                            -0.7674   0.1152  -6.6630 0.0000 -0.9931 -0.5417
Plastic and Reconstructive Surgery             -0.2478   0.4438  -0.5584 0.5766 -1.1176  0.6220
Podiatry                                        1.5335   0.0873  17.5703 0.0000  1.3624  1.7045
Portable X-Ray Supplier                        -0.0624   0.8977  -0.0695 0.9446 -1.8218  1.6970
Portable X-ray                                 -0.0748   0.8203  -0.0911 0.9274 -1.6825  1.5329
Preventive Medicine                            -0.0376   1.1563  -0.0325 0.9740 -2.3039  2.2287
Psychiatry                                      0.2690   0.1369   1.9656 0.0493  0.0008  0.5372
Psychologist (billing independently)            0.1542   0.4844   0.3184 0.7502 -0.7952  1.1037
Psychologist, Clinical                          0.1383   0.3533   0.3913 0.6955 -0.5542  0.8307
Pulmonary Disease                              -1.3563   0.2162  -6.2744 0.0000 -1.7800 -0.9327
Radiation Oncology                             -0.8469   0.2533  -3.3440 0.0008 -1.3432 -0.3505
Radiation Therapy                              -0.0128   2.0004  -0.0064 0.9949 -3.9335  3.9080
Registered Dietician/Nutrition Professional    -0.0255   1.4146  -0.0180 0.9856 -2.7980  2.7471
Rheumatology                                   -0.4658   0.2348  -1.9833 0.0473 -0.9261 -0.0055
Sleep Medicine                                 -0.0128   2.0002  -0.0064 0.9949 -3.9331  3.9076
Speech Language Pathologist                     0.0322   0.9095   0.0354 0.9717 -1.7504  1.8149
Sports Medicine                                -0.0996   0.7092  -0.1404 0.8883 -1.4896  1.2904
Surgical Oncology                              -0.0613   0.8966  -0.0684 0.9455 -1.8186  1.6960
Thoracic Surgery                               -0.3350   0.3935  -0.8513 0.3946 -1.1063  0.4363
Undefined Physician type                       -0.0127   2.0005  -0.0063 0.9950 -3.9335  3.9082
Unknown Physician Specialty Code               -0.0377   1.1556  -0.0327 0.9739 -2.3026  2.2271
Urology                                        -1.2770   0.1627  -7.8479 0.0000 -1.5959 -0.9581
Vascular Surgery                               -0.1695   0.2029  -0.8354 0.4035 -0.5672  0.2282
===============================================================================================

Future Steps

1) Split train vs. test set - understand how well a model performs and its evaluation metrics (precision, recall, AUC).

2) Here, we only tried a 50:50 class distribution for fraud vs. non fraud. It will be worth to experiment with different class splits and compare results.

3) Try other classification models that previous literature has found to be productive (XGBoost, SVM).

4) Try a neural net.

References